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Abstract 

Simultaneous interrogation of tumor genomes and transcriptomes is underway in unprecedented global efforts. 
Yet, despite the essential need to separate driver mutations modulating gene expression networks from 
transcriptionally inert passenger mutations, robust computational methods to ascertain the impact of individual 
mutations on transcriptional networks are underdeveloped. We introduce a novel computational framework, 
DriverNet, to identify likely driver mutations by virtue of their effect on mRNA expression networks. Application to 
four cancer datasets reveals the prevalence of rare candidate driver mutations associated with disrupted 
transcriptional networks and a simultaneous modulation of oncogenic and metabolic networks, induced by copy 
number co-modification of adjacent oncogenic and metabolic drivers. DriverNet is available on Bioconductor or at 
http://compbio.bccrc.ca/software/drivernet/. 
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Background 

Cancer genome sequencing experiments are designed to 
enumerate all somatic mutations within a cancer. Some of 
these mutations will serve as actionable genomic aberra- 
tions upon which to develop and apply targeted therapies 
(for example, mutations in PIK3CA, BRAF, and KRAS) 
and ultimately enabling rational frameworks for improved 
clinical management and patient care based on precise 
genomic patterns of somatic alteration. To this end, next 
generation sequencing (NGS) technology has shifted the 
rate-limiting step from identifying all cancer mutations in 
a sequenced genome to identifying the relatively few func- 
tional mutations that drive the phenotype of malignant 
cells. Therein lies a major challenge in the cancer geno- 
mics field: distinguishing pathogenic, driver mutations 
from the so-called passenger mutations that accrue sto- 
chastically, but do not confer selective advantages. 
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In order to discover novel driver mutations, several 
large-scale sequencing initiatives such as The Cancer Gen- 
ome Atlas project (TCGA, for example, [1]) are generating 
simultaneous whole genome and transcriptome interroga- 
tions for hundreds of cases of the same tumor type. This 
opens the possibility of ascribing the impact of individual 
somatic mutations on gene expression networks. Initial 
observations in high-throughput datasets, coupled with 
innumerable functional studies suggest that driver muta- 
tions are expected to alter gene expression of their cognate 
proteins, their interacting partners, or genes that share the 
same biochemical pathway. This will lead to a correlated 
pattern of gene expression in a network of genes asso- 
ciated with a driver mutation, which differs from benign 
passenger mutations with little to no phenotype. More- 
over, somatic aberrations in genes may alter more than 
one transcriptional network, thus enabling the enumera- 
tion of a group of pathways driven by a single genomic 
event. The importance of placing mutations in the context 
of their gene expression has been illuminated recently by 
Prahallad and colleagues [2], who established the thera- 
peutic effect of PLX4032 against the BRAF V600E onco- 
protein, which is mechanistically linked to the activation 
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of EGFR. Thus, differential expression of EGFR in different 
cell types (colon cancers versus melanomas) has a dra- 
matic impact on drug efficacy. Consequently, knowing 
active pathways coupled with mutational profiles will be 
critical for implementation of therapeutic decisions 
informed by the presence of mutations in a cancer. 

Current approaches for driver analysis typically rely on 
the frequency of aberration of a given gene or locus in a 
population of tumors as a function of the background 
mutation rate (for example, [3-5]). Recent whole gen- 
ome interrogations, however, have revealed the vast 
majority of mutated genes exhibit low population fre- 
quencies [6-10]. While most of these events can be 
explained by stochastically acquired mutations due to 
increased proliferation or acquisition of mutagenic pro- 
cesses, with no oncogenic properties, many others are in 
fact well-known pathogenic mutations with, in some 
cases, actionable clinical utility. For example, sequencing 
of complete exomes of 316 ovarian cancers [7] and 65 
triple negative breast cancers [11] revealed rare but 
functionally important and actionable mutations (for 
example, in ERBB2 and BRAF) in a small percentage of 
cases that were not identified by frequency and back- 
ground mutation rate analyses. Thus, frequency analysis 
will fail to recognize infrequent, but nonetheless impor- 
tant driver mutations. 

We suggest that integrative analysis of genomic aberra- 
tions and transcriptional profiles in cancer will reveal 
somatic mutations that drive biological processes, regard- 
less of the population frequency. Furthermore, we propose 
that biological networks can be leveraged to relate muta- 
tions to their consequent effect on transcription and gene 
expression. Figure 1A shows an example of high-level 
amplification of EGFR in a glioblastoma multiforme 
(GBM) tumor, accompanied by the coincident outlying 
expression of genes that are connected to EGFR through 
known biological pathways. We note that BRAF in this 
case, although not amplified itself, exhibits elevated 
expression compared to the population distribution. Other 
genes known to interact with EGFR exhibit similar 
extreme changes in expression levels in this example, such 
that PI3K signaling and MAPK signaling could be affected 
by this single genomic event. Figure IB shows fitted Gaus- 
sian expression distributions of three genes that interact 
with EGFR: FGF11, PIK3R1, and PRKACB, and shows that 
some cases with outlying expression have coincident 
EGFR amplifications. Our assumption is that amplification 
of EGFR in these cases has driven expression of the exam- 
ple genes to the tails of their respective distributions. 
Thus, extreme changes in expression levels of genes 
related to genomic aberrations are observable in orthogon- 
ally measured high-throughput transcriptome assays. As 
such, simultaneous analysis of genome and transcriptome 
measurements should amplify important signals in the 



data. Motivated by this idea, we hypothesize that driver 
aberrations will measurably disrupt transcriptional profiles 
regardless of their frequency in the population. 

Algorithmic frameworks to exploit the relationship 
between genomic events and consequent changes in gene 
expression to nominate putative driver genes are underde- 
veloped. We therefore propose an integrated genome/ 
transcriptome analysis framework, called DriverNet, to 
contextualize genomic aberrations (for example, mutations 
and copy number alterations) by their effect on transcrip- 
tional networks and identify candidate genomic aberra- 
tions suitable for functional experimental follow-up. Our 
approach allows individual mutations to be related to 
coincident changes in gene expression and assigns statisti- 
cal significance to candidate predictions, thus quantita- 
tively and rationally prioritizing candidate genes. We note 
that our intent differs from complementary approaches 
such as the one described by Vaske et al. [12], which aims 
at nominating driver pathways rather than driver genes in 
cancer, and from those that leverage genome data without 
considering expression [4,13]. Both Masica and Karchin 
[14] and Ciriello et al. [15] integrate genome and tran- 
scriptome relationships in their framework; however, they 
differ from our approach, since Masica and Karchin [14] 
do not utilize known biological pathway information and 
Ciriello et al. [15] only consider mRNA expression asso- 
ciated with copy number aberrations and not with muta- 
tions. Other methods focusing on copy number and 
expression associations do not consider mutations, nor do 
they employ the use of previously annotated pathways 
[16,17]. 

To study the properties and advantages of our approach, 
we analyzed four large-scale genome-transcriptome inter- 
rogations of tumor populations (Table 1) in human glio- 
mas, triple negative breast cancers, a population of nearly 
1,000 breast tumors (all subtypes) and high-grade serous 
ovarian cancers. We present results from three experi- 
ments: i) ascertainment of sensitivity and specificity in the 
context of several cancer datasets; ii) enumeration of well- 
known, but infrequent, drivers modulating transcriptional 
networks, and iii) identification of complex driver events 
that implicate compound metabolic and oncogenic path- 
way modulation from single genomic events. 

Results 

Overview of DriverNet approach 

We developed a novel, integrated algorithmic approach 
(DriverNet) to analyze population-based genomic and 
transcriptomic interrogations of tumor (sub)types for iden- 
tification of pathogenic driver mutations. Our approach 
relates genomic aberrations to disrupted transcriptional 
patterns, informed by known associations or interactions 
between genes. The full details of the algorithm are 
described in the Online Methods, but will be summarized 
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Overrepresented pathways of genes exhibiting outlying expression associated with EGFR high level amplification 
PI3K signaling: FOX01,FOX03,AKT1,AKT3,GNAI3,SOS1,PfK3R5,PIK3R1,GSK3B,JAK1 

MAPK signaling: PDGFA,FGF10,AKT1,MAP2K7,AKT3,EGFR,BRAF,FLNA,JUN,PPM1A,KRAS,ELK4, S0S1 ,JUND,NFATC4,RASA2,TAOK1 , 
RPS6KA2, RAP1A, CA CNA 1H 

ErbB signaling pathway: BTC,STAT5B,AKT1,MAP2K7,AKT3,EGFR,BRAF,JUN,SRC,KRAS,SOS1,CAMK2D,PIK3R5,PIK3R1,PLCG1,GSK3B 
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Figure 1 A schematic showing how DriverNet works, (a) An example of a Cytoscape visualization of a glioblastoma patient with a high-level 
amplification of epidermal growth factor receptor (EGFR) (shown in green) and coincident outlying expression of genes connected to EGFR in the 
Reactome influence graph (shown in yellow). Examples of the overrepresented pathways (by Reactome Fl plug-in for Cytoscape, FDR < 0.001) from 
the list of genes showing outlying expression associated with the EGFR amplification are depicted at the bottom. The box plot shows the population- 
level expression distribution of BRAF, an interacting protein with EGFR, and where the specific case with EGFR amplification sits on that distribution 
(red Y). We note that in this case, BRAF itself is not mutated or amplified, (b) Fitted Gaussian expression distributions of three genes that interact with 
EGFR. FGF1 1, PIK3R1, and PRKAGB, with each point indicating the probability density function for individual cases. For each gene, blue dots indicate 
cases with mutations in the gene itself and red arrows indicate cases with outlying expression with coincident EGFR amplifications, (c) Schematic 
representation of the DriverNet approach. Given the genomic aberration states for different patients and genes, gene expression data, and the 
influence graph, which captures biological pathway information, the bipartite graph shown on the right is constructed. Green nodes on the left 
partition of the bipartite graph correspond to aberrated genes and nodes on the right represent the outlying expression status for each patient where 
red indicates outlying patient-gene events from the gene expression matrix. The genes with the highest number of outlying expression events (for 
example, glj are nominated as putative drivers. 
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Table 1 Description of datasets 



Dataset 


Tumor type 


Number of cases 


Genomic aberrations 


Outliers 


Reference 


GBM 


glioblastoma 


120 


3,198 


26,956 


[6] 


GBM2 


glioblastoma 


140 


573 


35,618 




MlfTABRIC 


breast 


997 


18,331 


214,530 


[19] 


TN 


triple negative breast 


66 


4,824 


1 5,929 


[11] 


TN2 


triple negative breast 


66 


1,019 


1 5,929 




HGS 


serous ovarian 


304 


8,229 


91,697 


[7] 


HGS2 


serous ovarian 


307 


4,919 


92,491 





here in brief. Shown schematically in Figure 1C, DriverNet 
formulates associations between mutations and expression 
levels using a bipartite graph where nodes are: i) the set of 
genes representing the mutation status (the left partition 
of the graph) and ii) the set of genes representing outlying 
expression status in each of the patients (the right partition 
of the graph). For each patient, an edge between the nodes 
on the left and right partitions of the graph is drawn if the 
following three conditions are all satisfied: i) gene g t is 
mutated in patient p of the population (green nodes on 
the left partition of the graph); ii) gene gi shows outlying 
expression in patient p (red nodes on the right partition of 
the graph); and iii) gi and^, are known to interact accord- 
ing to pathway or gene set databases (an 'influence graph' 
after [18]). Our method then uses a greedy optimization 
approach to explain as many nodes on the right partition 
of the bipartite graph as possible using the fewest number 
of nodes on the left partition of the graph such that the 
genes explaining the highest number of outlying expres- 
sion events (for example, g 2 in Figure 1C) are nominated 
as putative driver genes. Finally, we apply statistical signifi- 
cance tests to these candidates based on null distributions 
informed by stochastic resampling. 

Datasets 

For our analysis, we used four publicly available datasets 
that contain genome and transcriptome data of several 
tumor types (Table 1). Detailed descriptions of the analysis 
of the datasets and pre-processing workflows can be found 
in Additional file 1. The GBM dataset represents copy 
number, mutations and expression data for 120 glioblas- 
toma multiforme patients [6] taken from the TCGA portal 
[19]. Note that the cases which had both mutation and 
copy number data were included in this dataset. The 
METABRIC dataset [20] represents copy number altera- 
tions and accompanying gene expression data for 997 
breast cancer patients. TN represents the validated muta- 
tions, copy number, and expression data for 66 triple 
negative breast cancer patients [11]. The TCGA HGS 
dataset contains mutations, copy number, and expression 
data for 304 high-grade serous ovarian cancer patients [7] 
that were taken from the TCGA portal. Like the GBM 
dataset, we only included the cases which had both 



mutation and copy number data. The data analysis work- 
flow is shown schematically in Additional file 2. The 
GBM2, TN2, and HGS2 datasets represent mutations only 
and gene expression data for 140, 66, and 307 glioblas- 
toma, triple negative, and high-grade serous ovarian can- 
cer patients, respectively. 

Performance benchmarking analysis establishes DriverNet 
as a sensitive and specific algorithm 

In practice, quantitative measurements with standard sen- 
sitivity/specificity benchmarking techniques are impracti- 
cal in the absence of ground truth. However, due to the 
availability of well-studied cancer gene databases, includ- 
ing the cancer gene census (CGC) [21] and the catalogue 
of somatic mutations in cancer datasets (COSMIC) [22], 
we set out to approximate performance metrics and com- 
pare DriverNet with the following two competing meth- 
ods: i) a method described by Masica and Karchin [14], 
which uses correlation-based statistics followed by a Fisher 
exact test to associate mutations with gene expression pat- 
terns (referred to as 'Fisher', see Additional file 1), ii) a 
method described in Youn and Simon [5], which identifies 
driver genes based on the background mutation rate, func- 
tional impact on proteins, and redundancy in genetic code 
(referred to as 'Frequency'). In adherence to both 
approaches mentioned above, we removed copy number 
data from the analysis and restricted the comparisons to 
mutation data only (GBM2, TN2, and HGS2, Table 1), 
resulting in the exclusion of the METABRIC dataset as it 
contained copy number aberration data only. We used 
two systematic benchmarking measures as follows: 
i) examining the proportion of predictions found in the 
Cancer Gene Census (CGC) database [21]; ii) examining 
the prevalence of somatic mutations of candidate genes in 
accordance with the COSMIC database, assuming genes 
with higher mutation prevalence in the corresponding 
patient population of interest in COSMIC (glioblastoma, 
breast and ovarian cancer) are more likely to be driver 
genes. Theoretically, this measure should favor the Fre- 
quency approach. 

To systematically evaluate specificity, we compared the 
proportion of predictions that were present in CGC as a 
function of decreasing sensitivity thresholds (Figure 2A, 
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Figure 2 DriverNet performance benchmarking with the GBM2, HGS2, and HGS2 datasets (A-C) Concordance with Cancer Gene Census 
for DriverNet, Frequency-based, and Fisher-based approaches as a function of the top N ranked genes (out of 200) for the GBM2, TN2, and 
HGS2 datasets, respectively. (D-F) Concordance with the COSMIC database (cumulative distribution of mutation prevalence in the COSMIC 
database) for DriverNet, Frequency-based, and Fisher-based approaches as a function of the top N ranked genes (out of 200) for the GBM2, TN2, 
and HGS2 datasets, respectively. Note that for the GBM2 dataset, DriverNet nominates 1 13 genes as candidate drivers, therefore, the 
concordance of DriverNet genes with the Cancer Gene Census is plotted for the 1 13 candidates. 



B, C) for all three methods. We also looked at the cumula- 
tive distribution of mutation prevalence in the COSMIC 
database for all three datasets (Figure 2D, E, F). Through- 
out the range of the top predictions output by DriverNet, 
the concordance with CGC was always higher than for 
Fisher and Frequency in the GBM2 and TN2 datasets. For 
HGS2, DriverNet and the Frequency approach outper- 
formed the Fisher method. The cumulative prevalence in 
the COSMIC dataset was higher for DriverNet compared 
to the other two approaches throughout the range of the 
top predictions, with Frequency second best. Thus, far 
fewer predictions are required by DriverNet to capture the 
majority of drivers in the dataset, indicating higher relative 
specificity. 

For GBM2 (mutations only), the Frequency method 
identified eight genes: EGFR, IDH1, NF1, PIK3R1, PTEN, 
RBI, TP53, and FKBP9 as significantly altered with seven 
of these found in CGC (Additional file 3). In total, Driver- 
Net identified 34 genes (p < 0.05) including seven of the 
genes nominated by the Frequency-based approach (Addi- 
tional file 4). Several genes found in CGC {PIK3C2G, 



MDM2, BCR, ERBB2, DD1T3, FGFR1, BRCA2, MET, and 
PDGFRA) were also among the top 34 genes nominated 
by DriverNet. We detected MET as the 29th ranked gene 
(p = 0.002, mutated in three cases), which was reported in 
[1], suggesting that it has been overlooked by the Fre- 
quency method, which ranked this gene as the 93rd. 

For TN2 (mutation only, no copy number), the Fre- 
quency method identified five genes: PIK3CA, RBI, TP53, 
PTEN, and MY03A as significantly altered genes by muta- 
tion, of which four were found in CGC (Additional file 5). 
In total, DriverNet identified 59 genes with p < 0.05, four 
of which were nominated by the Frequency-based 
approach (Additional file 6). A DriverNet prediction not 
identified by the Frequency approach included JAK1 (p = 
0, ranked 13th, mutated in one case), which plays a key 
role in prolactin signaling, which is implicated in breast 
cancer [23,24]. 

For HGS2 (mutation only, no copy number), the Fre- 
quency method identified CSMD3, BRCA1, BRCA2, and 
TP53 as significantly altered genes, three of which were 
found in CGC (Additional file 7). DriverNet identified 
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BRCA1, BRCA2, and TP53 in addition to CGC genes, 
KRAS, PTEN, KIT, NRAS, RPN1, RBI, PIK3CA, CLTCL1, 
ATIC, CREBBP, MET, PPP2R1A, CLTC, CTNNB1, BRAF, 
and TSHR (Additional file 8). BRAF, PI IOC A, KRAS, and 
NRAS are known oncogenic drivers and emphasize the 
power of integration of expression data to nominate 
important but infrequently mutated genes. In addition, 
the known tumor suppressor gene, PTEN, was among 
the top genes in DriverNet (rank 11th) but was over- 
looked by the Frequency method, which ranked this gene 
as 525th. 

Infrequent mutations modulating transcriptional networks 
feature prominently in population level datasets 

We then sought to ascertain the prevalence of rare dri- 
vers in all four datasets overlooked by Frequency-based 
approach to driver prediction. We identified 'infrequent' 
significant drivers {p < 0.05) where the gene of interest 
was abrogated by mutation or copy number alteration 
(CNA) in < 2% of cases. Due to unknown ground truth 
with respect to actual drivers, we restrict presentation to 
those genes also found in the CGC. This resulted in 22 
genes in METABRIC, 13 genes in HGS, 1 gene in TN, 
and 2 genes in GBM (Table 2). The infrequent drivers in 
METABRIC were PTEN, RBI, MDM2, MYC, CDKN2A, 
CLTC, CREBBP, GNAS, EGFR, CCNE1, EP300, CBL, 
PIK3R1, JAK2, TP53, NUP98, PIK3CA, IDH2, KRAS, and 
TRA@. Both PIK3CA (two cases with high-level amplifi- 
cations) and PIK3R1 (two cases with homozygous dele- 
tions) were altered in 0.19% of cases, and yet showed 
evidence of driving expression levels of the connected 
genes to the tails of the expression distribution. Interest- 
ingly, we identified seven cases (0.67%) with homozygous 
deletions in TP53 (locus 17pl3.1) coincident with outly- 
ing expression in MAPK and Wnt signaling pathways 
(Additional files 9 and 10). Loss of function of TPS3 
is typically associated with mutation; however, these 
results suggest that in rare cases, homozygous deletions 
may be the mechanism by which TP53 is lost in breast 
cancer. 

In HGS, we found 13 genes that were infrequent drivers 
also found in CGC (AKT2, KIT, NRAS, RPN, PIK3CA, 
CREBBP, PPP2R1A, ATIC, CLTCL1, MET, MAP2K4, 
ETV1, and EP300) (Table 2). Intriguingly, KIT (1.97% of 
cases) and NRAS (0.66% of cases) were detected as drivers 
(p = 2E-4 and 9E-4, respectively; Additional files 11 and 
12) where KIT is mutated in melanomas, gastrointestinal 
stromal tumors, adult acute myeloid leukemia patients, 
and many other tumor types at high frequency and is the 
target of the kinase inhibitor Imatinib. The mutations in 
NRAS (typically associated with melanomas, multiple mye- 
lomas, acute myelogenous leukemia, and thyroid cancer) 
were, in both cases, the Q61R hotspot mutation in the 



Ras domain. Both the KIT and NRAS mutations were 
overlooked as driver mutations by the Frequency-based 
approach (Additional file 7). This illustrates the increased 
sensitivity of DriverNet in identifying infrequent drivers in 
the population. Interestingly, mutations typically asso- 
ciated with lower grade (Type I) ovarian cancers such as 
PIK3CA (0.66% cases mutated) and CTNNB1 (0.6% cases 
mutated) were also nominated as drivers despite having 
extremely low frequency. The two PIK3CA mutations 
were both in well-known, activating hotspots, E545K and 
H1047R. We suggest that these (four separate) cases 
might actually be histologically misdiagnosed ovarian can- 
cers. These cases represent an important anecdote as 
many tumor populations contain rare mutations that cre- 
ate aberrant expression profiles. Type I ovarian cancers 
exhibit considerably different expression profiles com- 
pared to Type II high-grade serous cancers [25]. If indeed 
these cases are non-serous it would be unsurprising, given 
the DriverNet formulation of integration of genomic and 
transcriptomic profiles, that these rare mutations would 
cover many outlier events. In addition, we note that the 
previously mentioned MAP2K4 as an infrequent driver 
with a mutation in one case and homozygous deletions in 
two cases, and the presence of ETV1, typically known for 
gene fusions, are listed amongst the infrequent drivers in 
the HGS ovarian data. Finally, we cross-referenced the list 
of genes p < 0.05 with Cheung et al. [26] (a list of genes 
with genetic vulnerabilities in cancer cell lines) and noted 
that and CCNE1 overlapped. 

In the TN and GBM datasets, results were sparser. In 
the TN dataset, only one gene was an infrequent driver 
that was also in CGC: JAK1 with a mutation occurring in 
a single case (Table 2). JAK1 associated outliers were 
enriched for EGFR1 signaling (Additional files 13 and 14), 
suggesting that the mutation has downstream effects on 
an important oncogenic signaling network. In the GBM 
dataset, two genes, namely KRAS and AKT1, were infre- 
quent drivers and were also found in CGC. KRAS asso- 
ciated outliers were enriched for MAPK and PDGFR 
signaling and AKT1 outliers were enriched for FoxO 
family signaling (Additional files 15 and 16). AKT activa- 
tion is associated with many malignancies, where AKT 
acts, in part, by inhibiting FoxO tumor suppressors [27]. 
Collectively, investigations of rare drivers in METABRIC, 
HGS, TN, and GBM point out bona fide, but rare driver 
mutations, which would likely be omitted by methods 
examining genomic aberrations by selection or frequency 
analysis. These results indicate that rare driver mutations 
modulating expression networks comprise a meaningful 
component of the landscape of transcriptional variation 
attributed to the somatic genome, and thus should not be 
overlooked in the comprehensive enumeration of driver 
mutations in population-level studies. 
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Table 2 The predicted rare drivers 



Dataset 


Gene 


Gband 


SNV/lndel 


HLAMP 


HOMD 


Corrected P value 


Percent altered 


METABRIC 


PTEN 


1 0q23.31 


0 


0 


16 


0 


1.54 


METABRIC 


RBI 


1 3q 1 4.2 


o 


o 


16 


o 


1.54 


METABRIC 


MDM2 


12q15 


o 


I I 


0 


o 


1.06 


METABRIC 


MYC 


8q 24.21 


o 


10 


o 


o 


0.96 


METABRIC 


CDKN2A 


9p21.3 


0 


o 


16 


0 


1.54 


METABRIC 


CLTC 


17q23.1 


o 


16 


0 


0 


1 .5-1 


METABRIC 


CREBBP 


1 6p1 3.3 


o 


1 


2 


0 


0.29 


METABRIC 


GNAS 


20q 13.32 


o 


7 


0 


o 


0.67 


METABRIC 


EGFR 


7p1 1.2 


o 


3 


1 


o 


0.39 


METABRIC 


CDH1 


16q22.1 


0 


0 


16 


o 


1 .54 


METABRIC 


CCNE1 


19q12 


o 


6 


1 


o 


0.67 


METABRIC 


EP300 


22q 1 3.2 


o 
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Genomic copy number changes harboring known 
oncogenes simultaneously modulate metabolic pathways 

We next examined patterns of modulated expression asso- 
ciated with drivers occurring within the same high-level 
amplification or homozygous deletion. Surprisingly, we 
noted four examples in the METABRIC and GBM data- 
sets whereby genes proximal to known drivers and within 
the same genomic copy number change exhibited evidence 
for altering the expression of metabolic pathways exclusive 
of known oncogenic or tumor suppressor pathway modu- 
lation (Figure 3). PNMT encodes the phenylethanolamine 
N-methyltransferase enzyme and resides approximately 



20 Kb centromeric to ERBB2 with one intervening gene. 
ERBB2, amplified in approximately 15-20% of breast can- 
cers, is a well-known, targetable membrane-bound 
growth-factor receptor that is effectively inhibited by tras- 
tuzumab in clinical practice. The proximity of PNMT to 
ERBB2 results in co-amplification of both genes in nearly 
all cases (82/83 cases with high-level amplification of 
ERBB2 (Additional file 10)). PNMT was the top ranked 
driver in our analysis (ERBB2 was rank 3). When we 
examined the outlier genes associated with ERBB2 and 
PNMT, £/?££2-associated outlier genes were, as expected, 
enriched for Erbb signaling and EGF signaling pathways. 
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Figure 3 Simultaneous modulation of metabolic pathways in copy number alterations harboring known oncogenes EnrichmentMap 
[32] diagrams depicting Reactome pathways enriched in the set of outliers associated with pairs of genes that are co-amplified or co-deleted. In 
each pair, one gene is a known tumor suppressor or oncogene while the other is a metabolism gene. Pathways are shown as connected nodes 
in a graph where the size of the node indicates the number of genes in the pathway. Edges between nodes indicate genes common to both 
pathways where the thickness of the edge represents the degree of overlap. In general, little overlap was observed between metabolic drivers 
and oncogenic/tumor-suppressor drivers. (A) PNMT and ERBB2 co-amplified genes at the chr17q12 locus in breast cancer. (B) PAK1 and NDUFC2 
co-amplified genes at the 1 1 q 1 4 locus in breast cancer. (C) CDKN2A and MTAP co-deleted genes at chr9p21.3 in GBM. 



PA/MT-associated outliers were enriched for non-onco- 
genic macromolecule biosynthesis pathways including 
metabolic pathways and tyrosine metabolism (Figure 3A). 
The co-occurring modulation of oncogenic and metabolic 
pathways was also found in other high-level amplifications 
in METABRIC including the llql4 amplification of PAK1 
and NDUFC2 (Additional file 10). PAK1 (27 cases with 
high-level amplifications) shows evidence of driving EGFR 
signaling (Figure 3B) and importantly segregates with a 
poor outcome ER positive subtype as reported in [20]. 
NDUFC2 (30 cases with high-level amplifications), down- 
stream of PAK1 by approximately 660 Kb, encodes an 
NADH dehydrogenase enzyme. Outliers associated with 
NDUFC2 were associated with metabolic pathways and an 
oxidative phosphorylation pathway: a metabolic pathway 
that uses energy released by the oxidation of nutrients to 
produce adenosine triphosphate (Figure 3B). 

A similar pattern of simultaneous modulation of meta- 
bolic pathways by the copy number changes harboring 
known oncogenes was observed in GBM data. The cyclin- 
dependent kinase CDKN2A and the methylthioadenosine 
phosphorylase MTAP are separated by approximately 100 
Kb and are adjacent genes. MTAP (DriverNet rank 3) and 
known tumor-suppressor CDKN2A (DriverNet rank 4) are 
known to be co-deleted and they were observed as such in 
our analysis. We observed 53 cases with homozygous dele- 
tions in CDK2NA with accompanying co-deletion of 
MTAP in all cases (Additional file 16). In two additional 
cases with CDKN2A point mutations, MTAP was not 



found to be mutated or deleted. The enriched pathways of 
the CD/<2AM-associated outliers included cell cycle, p53 
signaling, and the FOXMI transcription factor network 
amongst others. The only significant enriched pathway of 
MTAP-deletion associated outliers was the metabolic 
pathway (Figure 3C). 

We examined PNMT-, NDUFC2-, and M7AP-associated 
outlying genes that were part of metabolic pathways and 
also ERBB2-, PAK1-, and CDXAQA-associated outlying 
genes that were related to the oncogenic/ tumor suppressor 
pathways. Outlying genes related to metabolic pathways 
and oncogenic/tumor suppressor pathways were distribu- 
ted across disparate loci in the genome eliminating co- 
amplification as the cause for the observed signals (Addi- 
tional file 17). 

The results of metabolic genes being co-aberrated with 
oncogenic and tumor suppressor genes suggest strongly 
that at least a portion of metabolic pathway disruption in 
cancer can be mechanistically attributed to somatic aberra- 
tions in the genome. Moreover, our results indicate the 
intriguing possibility that genomic aberrations harboring 
known oncogenic/tumor suppressor drivers are being 
selected for due to oncogenic pathway modulation coupled 
with non-overlapping metabolic pathway modulation. 

Discussion 

A major challenge in large-scale interrogation of genomic 
and transcriptomic profiles of tumor types is to contex- 
tualize genomic aberrations within their gene expression 
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profiles. Assessing the impact of a somatic mutation on 
the expression networks of a tumor provides strong evi- 
dence for its status as a driver. We presented a novel 
algorithm called DriverNet for integrative analysis of 
genomic and transcriptomic data derived from popula- 
tion-level studies of tumors. DriverNet associates the pre- 
sence of a mutated gene with its impact on the gene 
expression levels of its known interacting partners. We 
showed in several cancer datasets that this approach is 
both sensitive and specific with respect to known driver 
genes and is suitable for application in population-level 
datasets for numerous tumor types that will rapidly 
emerge in the coming years. 

Investigation of infrequent drivers revealed a surpris- 
ing number of rare mutations in known cancer genes 
typically associated with other cancers. Although infre- 
quent, they nonetheless modulate the expression profiles 
and their identification is critical to understanding the 
pathogenesis of the cancers that harbor them. We sug- 
gest that examination of genomic patterns in the popu- 
lation without the integration of the transcriptome 
would likely result in overlooking these important, but 
rare drivers. The structure of the bipartite graph induces 
an interplay between the influence graph, the frequency 
of mutations, and the frequency of aberrant expression. 
A natural question that arises is the role of both fre- 
quency of mutation and node degree in the ranking of 
the output. Additional files 18 and 19 show that while 
rank is correlated with both frequency and node degree, 
the relationship is not monotonic and therefore the 
structure of the graph does not deterministically order 
the output. This suggests instead that simultaneous 
observations in the genome and the transcriptome in 
many cases override the structure induced by the influ- 
ence graph and mutation frequency and can therefore 
penetrate the seemingly deterministic structure induced 
by the initial bipartite graph. 

Finally, we describe a set of aberrations whereby prox- 
imal drivers appear to simultaneously modulate onco- 
genic and metabolic pathways. This was observed in 
both breast cancer and GBM datasets and leaves open 
the possibility that selection of well-known drivers such 
as ERBB2 and EGFR may be synergistically acting on 
altered metabolic processes abrogated by co-altered, 
nearby metabolism genes. In light of recent renewed 
interest in studying altered metabolism in cancer [28] 
owing to IDH1/2 somatic mutations in AML and GBM, 
the compound effects of single genomic events on meta- 
bolic and oncogenic pathways, suggest that disruption of 
metabolic pathways by somatic mutations may be more 
widespread than previously thought and provides an 
impetus for novel therapies that might restore normal 
metabolic function in a cancer-cell specific manner. 



Limitations 

The DriverNet algorithm has some limitations. As outly- 
ing expression is computed in a deterministic manner, 
we may not be capturing less extreme but nonetheless 
important changes in expression that are modulated by 
a genomic event. Furthermore, DriverNet does not 
gracefully handle the directionality of the expression 
change. A probabilistic model would account for the 
subtler changes in expression handling; however, the 
combinatorial complexity of inference required in a fully 
probabilistic framework remains a daunting and unre- 
solved challenge because of the number of parameters 
to estimate. Thus, this remains an open problem. In 
addition, DriverNet relies on the genomic aberrations 
including mutations and extreme copy number altera- 
tion events that are supplied to the algorithm. The 
threshold to determine what constitutes a significant 
copy number alteration lies within third-party copy 
number analysis algorithms and can affect DriverNet 
results. Performance benchmarking suggest that, in 
most cases, DriverNet performs better when only 
extreme copy number alterations, that is, high-level 
amplifications and homozygous deletions, were included 
in the analysis (Additional file 20). Reducing the thresh- 
olds to detect more copy number alterations (such as 
chromosome-arm level events) results in too large a 
space of altered genes in a given dataset (Additional files 
21, 22, 23, 24). 

The DriverNet framework relies on a predetermined 
influence graph that is undoubtedly sparse and incom- 
plete. This is underscored by the omission in the 
METABRIC dataset of ZNF703, which resides in the 
amplification of the 8pl2 locus that includes FGFR1. 
We have recently described ZNF703 as a driver [29] in 
luminal B cancers; however, DriverNet was not posi- 
tioned to identify it due to its absence in the Reactome 
database. There are undoubtedly other false negative 
predictions due to poor characterization and lack of 
protein-protein interaction data; however, as interaction 
databases increase in density and volume of interactions, 
the DriverNet framework will be well placed to leverage 
such improvements. Nevertheless, our goal is not to dis- 
cover new protein interactions in this work, but rather 
to describe the association of mutations and expression 
in the context of well-understood knowledge bases. 
Finally, we note that this framework is suitable for data- 
sets with many patients sequenced. Ultimately, we wish 
to extend the framework for application to individual 
patients to determine the effectiveness of identification 
of actionable driver mutations for clinical use. This will 
require the accumulation of large gene expression repo- 
sitories for tumor types that can be used to contextua- 
lize a patient's expression and mutational profiles. 
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Conclusions 

We have presented a comprehensive analysis from four 
independent datasets of how transcriptional networks are 
affected by genomic aberrations in cancer and demonstrate 
how integrative analysis can be used effectively to identify 
novel driver genes in population-level studies of tumor 
genomes and transcriptomes. Our results demonstrate the 
power of integrative analysis across multiple tumor types 
in recently generated population-scale datasets in revealing 
infrequent, but functionally important, mutations and 
novel patterns of pathway disruption in cancer. We expect 
DriverNet to generalize well to planned future studies, 
including application to patient-specific mutational and 
expression profiles for genome/transcriptome-informed 
personalized cancer care. 

Methods 

In this section we present the essential details of the 
DriverNet algorithm. Additional details of data analysis, 
data preprocessing, and the Fisher method are presented 
in Additional file 1. 

Details of DriverNet algorithm 

Consider two gene-patient matrices. The first matrix M(i, j) 
represents a binary matrix where M(i, j) = 1 indicates gene 
i is mutated in patient / and M(i, j) = 0 indicates the 
absence of a mutation. Mutations can take the form of 
somatic point mutations, indels, copy number changes, or 
possibly epigenomic events. Matrix G{i, j) captures the 
real-valued gene expression measure of gene i in patient / 
and can be derived from gene expression arrays or RNA- 
Seq. Optionally, G(i, j) can be transformed into a matrix 
G'(i, j) indicating whether gene i in patient j is an outlier 
from the population-level distribution for that gene. Given 
these matrices, we can formulate the problem of finding 
driver mutations with a bipartite graph, « (Figure 1C), 
where nodes on the left represent genomic aberration sta- 
tus from M (green nodes show the genes that have a muta- 
tion in at least one patient) and nodes on the right are 
patient-gene events from G or G (for every patient, outliers 
are shown as red nodes). Edges are drawn between nodes 
in different partitions of the graph under the following con- 
ditions: for each patient p^ draw an edge between nodes g t 
in the left partition and gj for patient p^ in the right parti- 
tion, if gi is mutated, gj exhibits outlying expression, and g t 
and gj interact according to known gene networks (for 
example, Reactome FI [30]), termed the influence graph 
after [18]. 

The aim of the inference algorithm is to identify genes 
in the left partition that are connected to the most nodes 
in the right partition (for example, g2 as shown in Figure 
1C), thereby identifying mutated genes with the largest 
extent of transcriptional disruption, and simultaneously 



implicating a network of connected genes in the influ- 
ence graph with outlying expression that associate with 
the mutation. The genes are ranked according to their 
node coverage in the bipartite graph, ». If we denote the 
set of all the mutated genes by U, we postulate that the 
top n driver geneset D n £ U is the set of n genes that 
cover the maximum number of nodes on the right parti- 
tion of the bipartite graph. It should be noted that: i) due 
to different factors, all the outlying expression events 
may not be explained by the given mutations; and ii) the 
algorithm formulation makes the strong assumption that 
drivers will modulate the expression of many genes, 
which will primarily apply for genes that alter large, well- 
defined transcriptional networks. Finally, we observe that 
solving this problem is closely related to the minimum 
set cover problem, which is NP-hard. 

A greedy approximation algorithm to solve the 
optimization problem 

Given a set of elements (called the universe) and some 
sets whose union comprises the universe, the set cover 
problem is to identify the smallest number of sets whose 
union still contains all elements in the universe. The ana- 
logy of the minimum set cover problem to our driver 
mutation framework is as follows: i) elements of the uni- 
verse are the patient-gene (outlying expression) events, 
and ii) each mutation corresponds to a set that consists 
of those patient-gene events connected to this mutation 
in the bipartite graph. The greedy algorithm for our pro- 
blem is similar to that for the set cover problem: at each 
stage, choose a mutated gene that contains the largest 
number of uncovered outlying expression events (see 
Algorithm 1). The stopping condition is when all the 
connected outlying expression events are covered. In 
other words, the algorithm looks for the minimum cover- 
ing for all of the elements in the universe. It can be 
shown that the greedy algorithm achieves an approxima- 
tion ratio of H{s), where s is the size of the largest set and 
H(n) = Y^k=i l/k is the «th harmonic number. 

Significance tests 

The statistical significance of the driver genes are assessed 
using a randomization framework. The original datasets 
are permuted N = 500 times, and the algorithm is run on 
the N randomly generated datasets and results on real 
data are assessed to see if they are significantly different 
from the results on randomized datasets. This is an indir- 
ect way of perturbing the bipartite graph corresponding to 
the original problem. To generate the random datasets, we 
permute both the patient-mutation, M , and patient-out- 
lier, G', matrices according to the following procedure: 
i) construct a / x K zero matrix where / represents the 
number of patients and K represents the total number of 
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Algorithm 1 Greedy driver gene selection algorithm 



Require: i-<y.v,r, be the bipartite graph, where v denotes the set of nodes corresponding to mutated genes, v denotes the set of nodes 
corresponding to the patient-specific outlying expression events, and i denotes the set of edges between v- and v 

1: b-« //the set of selected driver genes 

2: zHis //the number of all the connected outlying expression events 

3: z <- 0 //the number of covered outlying expression events so far 
4: while z < Z do 

5 : ihb^w //pick mutated gene with the highest degree; in case of a tie, randomly pick one of the genes 

6: ,<-n*0*a.v) //update the number of covered outlying events 

7: b-bum //add g to the driver set 

9: for g' e 5 do 

10: //remove the node g' and its connected edges from » 

1 1 : end for 
1 2: end while 
1 3: ™d 



Ensmbl 54 protein-coding genes, ii) put 1 in A/total ran- 
domly selected cells, where A^otai represents either the 
total number of mutations or the total number of outiying 
genes depending on which matrix is permuted, iii) remove 
the columns where their elements are 0. Using the same 
influence graph, the algorithm is run on the N = 500 per- 
muted patient-mutation, Mi... M N , and patient-outlier, 
Gi'... G N \ matrices. 

Suppose D is the result of the driver mutation discovery 
algorithm. D contains a ranked list of driver genes with 
their corresponding node coverage in the bipartite graph, 
2. The statistical significance of a gene g&D with a corre- 
sponding node coverage, COVg, is the fraction of times 
that we observe driver genes with the node coverage of 
more than COV^ in the N = 500 random runs of the algo- 
rithm: 

N Si 

EE*[coVtf >cov g ] 

pvalue(g) = — 

where S, is the number of drivers identified in the i'th 
run of the algorithm. We then use the Benjamini-Hoch- 
berg approach for correcting the P values for multiple 
tests. 

Building the influence graph 

The influence graph captures the knowledge about the 
influence of mutation in a gene on the change of expres- 
sion of another gene. Various sources of information such 
as the protein-protein interaction (PPI) networks or net- 
works based on copy number and/or expression data can 
be used to build the influence graph. In this paper, we uti- 
lize the protein functional interaction network derived in 
[30] to build the influence graph. This network extends 



the protein functional interaction network in curated path- 
ways with non-curated sources of information, including 
protein-protein interactions, gene co-expression, protein 
domain interaction, gene ontology (GO) annotations, and 
text-mined protein interactions, which cover close to 50% 
of the human proteome. 

Implementation 

The DriverNet algorithm is implemented in a publicly 
available R package [31]. The memory complexity of the 
greedy algorithm is 0{M N + M R + R 2 \ where M is the 
number of patients, N is the number of mutated genes, 
and 7? is the number of genes with gene expression values 
and also in the influence graph. The algorithm needs 
memory to hold the patient-mutation matrix, the patient- 
outlier matrix, and the influence graph. Note that all the 
three matrices are sparse binary matrices, thus the mem- 
ory usage can be decreased by using sparse representation 
of the matrices. If we rank all the mutated genes, the time 
complexity is 0(8 x N (N + l)/2), where 8 is the time used 
to compute the explained outliers by a gene, which is 
bounded by its node degree of the influence graph. In 
practice, the algorithm is fast when the memory usage 
is low. For example, for the GBM dataset, it takes about 
1 minute to run on a dual-core desktop Mac computer 
without computing the empirical P values. 

Additional material 



Additional file 1: Supplementary text 
Additional file 2: Data analysis workflow 

Additional file 3: Ranked list of candidate driver genes using the 
Youn-Simon approach for the GBM2 dataset. rank: rank of the gene, 
hgnc_symbol: gene symbol, p.value: P value, p.adjust: adjusted P value 
using the Benjamini-Hochberg approach. 

Additional file 4: Ranked list of candidate driver genes for the 
GBM2 dataset. rank: rank of the gene according to DriverNet, gene: 
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gene symbol, gband: gene chromosome location and gene band, 5NV. 
Indel: number of cases with SNV or indel in that specific gene, HLAMP: 
number of cases with copy number high-level amplifications, AMP: 
number of cases with copy number amplifications, HOMD: number of 
cases with copy number homozygous deletions, HETD: number of cases 
with copy number hemizygous deletions, covered events: the number of 
events (edges) connected to the gene on the left of the bipartite graph, 
node degree: the number of genes connected to the gene of interest in 
the influence graph, p.value: P value corrected for the multiple test using 
the Benjamini-Hochberg approach, CGC. status: Cancer Gene Census 
(CGC) membership status (1 = found in CGC, 0 = not in CGC), 
percentage.event: percentage of cases with genomic aberrations in the 
gene of interest, p.way: top pathways associated with outlying genes 
(posterior probability > 0.8); numbers in parentheses show the posterior 
probability. 

Additional file 5: Ranked list of candidate driver genes using the 
Youn-Simon approach for the TN2 dataset rank: rank of the gene, 
hgnc_symbol: gene symbol, p.value: P value, p.adjust.BH: adjusted P value 
using the Benjamini- Hochberg approach. 

Additional file 6: Ranked list of candidate driver genes for the TN2 
dataset. rank: rank of the gene according to DriverNet, gene: gene 
symbol, gband: gene chromosome location and gene band, SNV.Indel: 
number of cases with SNV or indel in that specific gene, HLAMP: number 
of cases with copy number high-level amplifications, AMP: number of 
cases with copy number amplifications, HOMD: number of cases with 
copy number homozygous deletions, HETD: number of cases with copy 
number hemizygous deletions, covered events: the number of events 
(edges) connected to the gene on the left of the bipartite graph, node 
degree: the number of genes connected to the gene of interest in the 
influence graph, p.value: P value corrected for the multiple test using the 
Benjamini-Hochberg approach, CGC.status: Cancer Gene Census (CGC) 
membership status (1 = found in CGC, 0 = not in CGC), percentage, 
event: percentage of cases with genomic aberrations in the gene of 
interest, p.way: top pathways associated with outlying genes (posterior 
probability > 0.8); numbers in parentheses show the posterior probability. 

Additional file 7: Ranked list of candidate driver genes using the 
Youn-Simon approach for the HGS2 dataset rank: rank of the gene, 
hgnc_symbol: gene symbol, p.value: P value, p.adjust: adjusted P value 
using the Benjamini-Hochberg approach. 

Additional file 8: Ranked list of candidate driver genes for the 
HGS2 dataset. rank: rank of the gene according to DriverNet, gene: 
gene symbol, gband: gene chromosome location and gene band, SNV. 
Indel: number of cases with SNV or indel in that specific gene, HLAMP: 
number of cases with copy number high-level amplifications, AMP: 
number of cases with copy number amplifications, HOMD: number of 
cases with copy number homozygous deletions, HETD: number of cases 
with copy number hemizygous deletions, covered events: the number of 
events (edges) connected to the gene on the left of the bipartite graph, 
node degree: the number of genes connected to the gene of interest in 
the influence graph, p.value: P value corrected for the multiple test using 
the Benjamini-Hochberg approach, CGC.status: Cancer Gene Census 
(CGC) membership status (1 = found in CGC, 0 = not in CGC), 
percentage.event: percentage of cases with genomic aberrations in the 
gene of interest, p.way: top pathways associated with outlying genes 
(posterior probability > 0.8); numbers in parentheses show the posterior 
probability. 

Additional file 9: Ranked list of candidate driver genes for the 
METABRIC dataset. rank: rank of the gene according to DriverNet, gene: 
gene symbol, gband: gene chromosome location and gene band, SNV. 
Indel: number of cases with SNV or indel in that specific gene, HLAMP: 
number of cases with copy number high-level amplifications, AMP: 
number of cases with copy number amplifications, HOMD: number of 
cases with copy number homozygous deletions, HETD: number of cases 
with copy number hemizygous deletions, covered events: the number of 
events (edges) connected to the gene on the left of the bipartite graph, 
node degree: the number of genes connected to the gene of interest in 
the influence graph, p.value: P value corrected for the multiple test using 
the Benjamini-Hochberg approach, CGC.status: Cancer Gene Census 
(CGC) membership status (1 = found in CGC, 0 = not in CGC), 



percentage.event: percentage of cases with genomic aberrations in the 
gene of interest, p.way: top pathways associated with outlying genes 
(posterior probability > 0.8); numbers in parentheses show the posterior 
probability. 

Additional file 10: Figure showing the SNVs/indels, homozygous 
deletion (HOMD), and high-level amplification (HLAMP) status 
across the patients for the top 190 candidate driver genes (ranked 
from top to bottom) for the METABRIC dataset Genes with P values 
< 0.05 are shown. Red blocks show HLAMPs and blue show HOMDs for 
each case. 

Additional file 11: Ranked list of candidate driver genes for the 
HGS dataset. rank: rank of the gene according to DriverNet, gene: gene 
symbol, gband: gene chromosome location and gene band, SNV.Indel: 
number of cases with SNV or indel in that specific gene, HLAMP: number 
of cases with copy number high-level amplifications, AMP: number of 
cases with copy number amplifications, HOMD: number of cases with 
copy number homozygous deletions, HETD: number of cases with copy 
number hemizygous deletions, covered events: the number of events 
(edges) connected to the gene on the left of the bipartite graph, node 
degree: the number of genes connected to the gene of interest in the 
influence graph, p.value: P value corrected for the multiple test using the 
Benjamini-Hochberg approach, CGC.status: Cancer Gene Census (CGC) 
membership status (1 = found in CGC, 0 = not in CGC), percentage, 
event: percentage of cases with genomic aberrations in the gene of 
interest, p.way: top pathways associated with outlying genes (posterior 
probability > 0.8); numbers in parentheses show the posterior probability. 

Additional file 12: Figure showing the SNVs/indels, homozygous 
deletion (HOMD), and high-level amplification (HLAMP) status 
across the patients for the top 144 candidate driver genes (ranked 
from top to bottom) for the HGS dataset Genes with P values < 005 
are shown. Green blocks show SNVs or indels, red blocks show HLAMPs, 
and blue show HOMDs for each case. 

Additional file 13: Ranked list of candidate driver genes for the TN 
dataset. rank: rank of the gene according to DriverNet, gene: gene 
symbol, gband: gene chromosome location and gene band, SNV.Indel: 
number of cases with SNV or indel in that specific gene, HLAMP: number 
of cases with copy number high-level amplifications, AMP: number of 
cases with copy number amplifications, HOMD: number of cases with 
copy number homozygous deletions, HETD: number of cases with copy 
number hemizygous deletions, covered events: the number of events 
(edges) connected to the gene on the left of the bipartite graph, node 
degree: the number of genes connected to the gene of interest in the 
influence graph, p.value: P value corrected for the multiple test using the 
Benjamini-Hochberg approach, CGC.status: Cancer Gene Census (CGC) 
membership status (1 = found in CGC, 0 = not in CGC), percentage, 
event: percentage of cases with genomic aberrations in the gene of 
interest, p.way: top pathways associated with outlying genes (posterior 
probability > 0.8); numbers in parentheses show the posterior probability. 

Additional file 14: Figure showing the SNVs/indels, homozygous 
deletion (HOMD), and high-level amplification (HLAMP) status 
across the patients for the top 50 candidate driver genes (ranked 
from top to bottom) for the TN dataset. Genes with P values < 0.05 
are shown. Green blocks show SNVs or indels, red blocks show HLAMPs, 
and blue show HOMDs for each case. 

Additional file 15: Ranked list of candidate driver genes for the 
GBM dataset. rank: rank of the gene according to DriverNet, gene: gene 
symbol, gband: gene chromosome location and gene band, SNV.Indel: 
number of cases with SNV or indel in that specific gene, HLAMP: number 
of cases with copy number high-level amplifications, AMP: number of 
cases with copy number amplifications, HOMD: number of cases with 
copy number homozygous deletions, HETD: number of cases with copy 
number hemizygous deletions, covered events: the number of events 
(edges) connected to the gene on the left of the bipartite graph, node 
degree: the number of genes connected to the gene of interest in the 
influence graph, p.value: P value corrected for the multiple test using the 
Benjamini-Hochberg approach, CGC.status: Cancer Gene Census (CGC) 
membership status (1 = found in CGC, 0 = not in CGC), percentage, 
event: percentage of cases with genomic aberrations in the gene of 
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interest, p.way: top pathways associated with outlying genes (posterior 
probability > 0.8); numbers in parentheses show the posterior probability. 

Additional file 16: Figure showing the SNVs/indels, homozygous 
deletion (HOMD), and high-level amplification (HLAMP) status 
across the patients for the top 49 candidate driver genes (ranked 
from top to bottom) for the GBM dataset. Genes with P values <0.05 
are shown. Green blocks show SNVs or indels, red blocks show HLAMPs, 
and blue show HOMDs for each case. 

Additional file 17: Circos plots showing outlying genes related to 
metabolic pathways for PNMT (A), NDUFC2 (B), and MTAP (C) and 
outlying genes related to oncogenic/tumor suppressor pathways 
for ERBB2 (D), PAK1 (E), and CDKN2A (F) genes. 

Additional file 18: Frequency of aberrations versus the rank of 
significant genes (p < 0.05) for the GBM (A), HGS (B), TN (C), and 
METABRIC (D) datasets. 

Additional file 19: Node degree in the influence graph versus the 
rank of significant genes (p < 0.05) for the GBM (A), HGS (B), TN (C), 
and METABRIC (D) datasets 

Additional file 20: DriverNet performance benchmarking on GBM, 
TN, HGS, and METABRIC datasets when copy number amplifications 
(AMP) and hemizygous deletions (HETDs) were included in addition 
to the high-level amplifications (HLAMP) and homozygous 
deletions (HOMDs). (A-D) Concordance with Cancer Gene Census for 
DriverNet, Frequency-based, and Fisher-based approaches as a function 
of the top N ranked genes (out of 200) for the GBM, TN, HGS, and 
METABRIC datasets, respectively. (E-H) Concordance with COSMIC 
database (cumulative distribution of mutation prevalence in the COSMIC 
database) for DriverNet, Frequency-based, and Fisher-based approaches 
as a function of the top N ranked genes (out of 200) for the GBM, TN, 
HGS, and METABRIC datasets, respectively. 

Additional file 21: Ranked list of candidate driver genes for the 
METABRIC dataset when copy number amplifications and 
hemizygous deletions were included in addition to the mutations, 
high-level amplifications, and homozygous deletions rank: rank of 
the gene according to DriverNet, gene: gene symbol, gband: gene 
chromosome location and gene band, SNV.Indel: number of cases with 
SNV or indel in that specific gene, HLAMP: number of cases with copy 
number high-level amplifications, AMP: number of cases with copy 
number amplifications, HOMD: number of cases with copy number 
homozygous deletions, HETD: number of cases with copy number 
hemizygous deletions, covered events: the number of events (edges) 
connected to the gene on the left of the bipartite graph, node degree: 
the number of genes connected to the gene of interest in the influence 
graph, p.value: P value corrected for the multiple test using the 
Benjamini-Hochberg approach, CGC.status: Cancer Gene Census (CGC) 
membership status (1 = found in CGC, 0 = not in CGC), percentage, 
event: percentage of cases with genomic aberrations in the gene of 
interest, p.way: top pathways associated with outlying genes (posterior 
probability > 0.8); numbers in parentheses show the posterior probability. 

Additional file 22: Ranked list of candidate driver genes for the 
HGS dataset when copy number amplifications and hemizygous 
deletions were included in addition to the mutations, high-level 
amplifications, and homozygous deletions rank: rank of the gene 
according to DriverNet, gene: gene symbol, gband: gene chromosome 
location and gene band, SNV.Indel: number of cases with SNV or indel in 
that specific gene, HLAMP: number of cases with copy number high- 
level amplifications, AMP: number of cases with copy number 
amplifications, HOMD: number of cases with copy number homozygous 
deletions, HETD: number of cases with copy number hemizygous 
deletions, covered events: the number of events (edges) connected to 
the gene on the left of the bipartite graph, node degree: the number of 
genes connected to the gene of interest in the influence graph, p.value: 
P value corrected for the multiple test using the Benjamini-Hochberg 
approach, CGC.status: Cancer Gene Census (CGC) membership status (1 = 
found in CGC, 0 = not in CGC), percentage. event: percentage of cases 
with genomic aberrations in the gene of interest, p.way: top pathways 
associated with outlying genes (posterior probability > 0.8); numbers in 
parentheses show the posterior probability. 



Additional file 23: Ranked list of candidate driver genes for the TN 
dataset when copy number amplifications and hemizygous 
deletions were included in addition to the mutations, high-level 
amplifications, and homozygous deletions rank: rank of the gene 
according to DriverNet, gene: gene symbol, gband: gene chromosome 
location and gene band, SNV.Indel: number of cases with SNV or indel in 
that specific gene, HLAMP: number of cases with copy number high- 
level amplifications, AMP: number of cases with copy number 
amplifications, HOMD: number of cases with copy number homozygous 
deletions, HETD: number of cases with copy number hemizygous 
deletions, covered events: the number of events (edges) connected to 
the gene on the left of the bipartite graph, node degree: the number of 
genes connected to the gene of interest in the influence graph, p.value: 
P value corrected for the multiple test using the Benjamini-Hochberg 
approach, CGC.status: Cancer Gene Census (CGC) membership status (1 = 
found in CGC, 0 = not in CGC), percentage.event: percentage of cases 
with genomic aberrations in the gene of interest, p.way: top pathways 
associated with outlying genes (posterior probability > 0.8); numbers in 
parentheses show the posterior probability. 

Additional file 24: Ranked list of candidate driver genes for the 
GBM dataset when copy number amplifications and hemizygous 
deletions were included in addition to the mutations, high-level 
amplifications, and homozygous deletions rank: rank of the gene 
according to DriverNet, gene: gene symbol, gband: gene chromosome 
location and gene band, SNV.Indel: number of cases with SNV or indel in 
that specific gene, HLAMP: number of cases with copy number high- 
level amplifications, AMP: number of cases with copy number 
amplifications, HOMD: number of cases with copy number homozygous 
deletions, HETD: number of cases with copy number hemizygous 
deletions, covered events: the number of events (edges) connected to 
the gene on the left of the bipartite graph, node degree: the number of 
genes connected to the gene of interest in the influence graph, p.value: 
P value corrected for the multiple test using the Benjamini-Hochberg 
approach, CGC.status: Cancer Gene Census (CGC) membership status (1 = 
found in CGC, 0 = not in CGC), percentage.event: percentage of cases 
with genomic aberrations in the gene of interest, p.way: top pathways 
associated with outlying genes (posterior probability > 0.8); numbers in 
parentheses show the posterior probability. 



Abbreviations 

AMP: amplifications; CGC: cancer gene census; CNA: copy number alteration; 
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